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Abstract 

We introduce a new framework for unsupervised 
learning of representations based on a novel hi¬ 
erarchical decomposition of information. Intu¬ 
itively, data is passed through a series of pro¬ 
gressively fine-grained sieves. Each layer of the 
sieve recovers a single latent factor that is maxi¬ 
mally informative about multivariate dependence 
in the data. The data is transformed after each 
pass so that the remaining unexplained informa¬ 
tion trickles down to the next layer. Ultimately, 
we are left with a set of latent factors explain¬ 
ing all the dependence in the original data and 
remainder information consisting of independent 
noise. We present a practical implementation of 
this framework for discrete variables and apply it 
to a variety of fundamental tasks in unsupervised 
learning including independent component anal¬ 
ysis, lossy and lossless compression, and predict¬ 
ing missing values in data. 

The hope of finding a succinct principle that elucidates the 
brain’s information processing abilities has often kindled 
interest in information-theoretic ideas (Barlow, 1989; Si- 
moncelli & Olshausen, 2001). In machine learning, on 
the other hand, the past decade has witnessed a shift in 
focus toward expressive, hierarchical models, with suc¬ 
cesses driven by increasingly effective ways to leverage la¬ 
beled data to learn rich models (Schmidhuber, 2015; Ben- 
gio et al., 2013). Information-theoretic ideas like the vener¬ 
able InfoMax principle (Linsker, 1988; Bell & Sejnowski, 
1995) can be and are applied in both contexts with empiri¬ 
cal success but they do not allow us to quantify the informa¬ 
tion value of adding depth to our representations. We intro¬ 
duce a novel incremental and hierarchical decomposition of 
information and show that it defines a framework for unsu- 
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pervised learning of deep representations in which the in¬ 
formation contribution of each layer can be precisely quan¬ 
tified. Moreover, this scheme automatically determines the 
structure and depth among hidden units in the representa¬ 
tion based only on local learning rules. 

The shift in perspective that enables our information de¬ 
composition is to focus on how well the learned repre¬ 
sentation explains multivariate mutual information in the 
data (a measure originally introduced as “total correla¬ 
tion” (Watanabe, I960)). Intuitively, our approach con¬ 
structs a hierarchical representation of data by passing it 
through a sequence of progressively fine-grained sieves. At 
the first layer of the sieve we learn a factor that explains as 
much of the dependence in the data as possible. The data is 
then transformed into the “remainder information”, which 
has this dependence extracted. The next layer of the sieve 
looks for the largest source of dependence in the remainder 
information, and the cycle repeats. At each step, we obtain 
a successively tighter upper and lower bound on the multi¬ 
variate information in the data, with convergence between 
the bounds obtained when the remaining information con¬ 
sists of nothing but independent factors. Because we end up 
with independent factors, one can also view this decompo¬ 
sition as a new way to do independent component analysis 
(ICA) (Comon, 1994; Hyvarinen & Oja, 2000). Unlike tra¬ 
ditional methods, we do not assume a specific generative 
model of the data (i.e., that it consists of a linear trans¬ 
formation of independent sources) and we extract indepen¬ 
dent factors incrementally rather than all at once. The im¬ 
plementation we develop here uses only discrete variables 
and is therefore most relevant for the challenging problem 
of ICA with discrete variables, which has applications to 
compression (Painsky et al., 2014). 

After providing some background in Sec. 1, we introduce 
a new way to iteratively decompose the information in data 
in Sec. 2, and show how to use these decompositions to 
define a practical and incremental framework for unsuper¬ 
vised representation learning in Sec. 3. We demonstrate the 
versatility of this framework by applying it first to indepen¬ 
dent component analysis (Sec. 4). Next, we use the sieve 
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as a lossy compression to perform tasks typically relegated 
to generative models including in-painting and generating 
new samples (Sec. 5). Finally, we cast the sieve as a loss¬ 
less compression and show that it beats standard compres¬ 
sion schemes on a benchmark task (Sec. 6). 

1 Information-theoretic learning 
background 

Using standard notation (Cover & Thomas, 2006), capital 
Xi denotes a random variable taking values in some do¬ 
main and whose instances are denoted in lowercase, Xi . In 
this paper, the domain of all variables are considered to be 
discrete and finite. We abbreviate multivariate random vari¬ 
ables, X = Xi:n = Xi,, Xn, with an associated prob¬ 
ability distribution, px(^i = • • • ,Xn = x^), which 

is typically abbreviated to p(x). We will index different 
groups of multivariate random variables with superscripts, 
X^, as defined in Fig. 1. We let X^ denote the original ob¬ 
served variables and we often omit the superscript in this 
case for readability. 

Entropy is defined in the usual way sls H{X) = 
Ex[log l/p(x)]. We use base two logarithms so that the 
unit of information is bits. Higher-order entropies can be 
constructed in various ways from this standard definition. 
For instance, the mutual information between two groups 
of random variables, X and Y can be written as the reduc¬ 
tion of uncertainty in one variable, given information about 
the other, I{X; Y) = H{X) - H{X\Y). 

The “InfoMax” principle (Linsker, 1988; Bell & Se- 
jnowski, 1995) suggests that for unsupervised learning we 
should construct F’s to maximize their mutual information 
with X, the data. Despite its intuitive appeal, this approach 
has several potential problems (see (Ver Steeg et al., 2014) 
for one example). Here we focus on the fact that the Info- 
Max principle is not very useful for characterizing “deep 
representations”, even though it is often invoked in this 
context (Vincent et al., 2008). This follows directly from 
the data processing inequality (a similar argument appears 
in (Tishby & Zaslavsky, 2015)). Namely, if we start with 
X, construct a layer of hidden units Y^ that are a function 
of X, and continue adding layers to a stacked representa¬ 
tion so that X Y^ Y‘^ .. .Y^, then the information 
that the V’s have about X cannot increase after the first 
layer, I{X; Y^‘^) = I{X;Y^). From the point of view of 
mutual information, V^ is a copy and Y‘^ is just a copy of 
a copy. While a coarse-grained copy might be useful, the 
InfoMax principle does not quantify how or why. 

Instead of looking for a Y that memorizes the data, we shift 
our perspective to searching for a V so that the X^’s are as 
independent as possible conditioned on this Y. Essentially, 
we are trying to reconstruct the latent factors that are the 


cause of the dependence in Xi. To formalize this, we in¬ 
troduce the multivariate mutual information which was first 
introduced as “total correlation” (Watanabe, 1960). 


TC{X) = 


Dkl (p(a;)|| II^^(^*) 




'£H{Xi)-H{X) 

i=l 


( 1 ) 


This quantity reflects the dependence in X and is zero if 
and only if the X^’s are independent. Just as mutual infor¬ 
mation is the reduction of entropy in X after conditioning 
on y, we can define the reduction in multivariate informa¬ 
tion in X after conditioning on Y. 


TC{X;Y) = 


TC{X) -TC{X\Y) 

n 

J2l{Xi;Y)-I{X;Y). 


( 2 ) 


That TC{X) can be hierarchically decomposed in terms 
of short and long range dependencies was already appreci¬ 
ated by Watanabe (Watanabe, 1960) and has been used in 
applications such as hierarchical clustering (Kraskov et al., 
2005). This provides a hint about how higher levels of hier¬ 
archical representations can be useful: more abstract repre¬ 
sentations should reflect longer range dependencies in the 
data. Our contribution below is to demonstrate a tractable 
approach for learning a hierarchy of latent factors, Y, that 
exactly capture the multivariate information in X. 

2 Incremental information decomposition 

We consider any set of probabilistic functions of some in¬ 
put variables, X, to be a “representation” of X. Looking at 
Fig. 1(a), we consider a representation with a single learned 
latent factor, Y . Then, we try to save the information in X 
that is not captured by Y into the “remainder information”, 
X. The final result is encapsulated in Cor. 2.4 which says 
that we can repeat this procedure iteratively (as in Fig. 1(b)) 
and TC{X) decomposes into a sum of non-negative con¬ 
tributions from each Y^. Note that X*^^^ includes Y^, so 
that Y’s at subsequent layers can depend on latent factors 
learned at earlier layers. 

Theorem 2.1. Incremental Decomposition of Infor¬ 
mation Let Y be some (deterministic) function of 
Xi ,..., Xn and let Xi be a probabilistic function ofXi , Y, 
for each i = 1,..., n. Then the following upper and lower 
bounds onTC{X) hold: 

A < TC{X) - (TC(X) + TC(X; Y)) < B 

n n 

A = -J2liXi;Y), B = J2H{Xi\Xi,Y) 

i=l i=l 
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Figure 1. (a) This diagram describes one layer of the information 
sieve. In this graphical model, the variables in the top layer (X^’s) 
represent (observed) input variables. Y is some function of all the 
Xi’s that is optimized to be maximally informative about multi¬ 
variate dependence in X. The remainder information, Xi depends 
on Xi and Y and is set to contain information in Xi that is not 
captured by Y. (b) Summary of variable naming scheme for mul¬ 
tiple layers of the sieve. The input variables are in bold and the 
learned latent factors are in red. 

A proof is provided in App. B. Note that the remainder 
information, X = Xi,, Xn, V, includes Y. Bounds 
on TC{X) also provide bounds on H{X) by using Eq. 1. 
Next, we point out that the remainder information, X, can 
be chosen to make these bounds tight. 

Lemma 2.2. Construction of perfect remainder in¬ 
formation For discrete, finite random variables Xi , Y 
drawn from some distribution, p{Xi^Y), it is possible to 
define another random variable Xi ^ p{Xi\Xi^Y) that 
satisfies the following two properties: 

(i) ^) = 0 Remainder contains no Y info 

(ii) H{Xi\Xi, y) = 0 Original is perfectly recoverable 

We give a concrete construction in App. C. We would like 
to point out one caveat here. The cardinality of Xi may 
have to be large to satisfy these equalities. For a fixed num¬ 
ber of samples, this may cause difficulties with estimation, 
as discussed in Sec. 3. With perfect remainder information 
in hand, our decomposition becomes exact. 

Corollary 2.3. Exact decomposition For Y a function 
of X and perfect remainder information, Xifi = 1,..., n, 
as defined in Lemma 2.2, the following decomposition 
holds: 


dependence in X. This decomposition can then be iterated 
to extract more and more information from the data. 

Corollary 2.4. Iterative decomposition Using the vari¬ 
able naming scheme in Fig. 1(b), we construct a hierarchi¬ 
cal representation where each Yk is a function of X^~^ 
and X^ includes the (perfect) remainder information from 
X^~^ according to Lemma 2.2. 

r 

TC{X) = TC{X^) + rC(X''-i; Xfc) (5) 

k=l 


It is easy to check that Eq. 5 results from repeated applica¬ 
tion of Cor. 2.3. We show in the next section that the quan¬ 
tities of the form TC ; Yk) can be estimated and opti¬ 
mized over efficiently, despite involving high-dimensional 
variables. As we add the (non-negative) contributions from 
optimizing TC{X^~^]Yk), the remaining dependence in 
the remainder information, TC{X^), must decrease be¬ 
cause TC{X) is some data-dependent constant. Decom¬ 
posing data into independent factors is exactly the goal of 
ICA, and the connections are discussed in Sec. 4. 

3 Implementing the sieve 

Because this learning framework contains many unfamiliar 
concepts, we consider a detailed analysis of a toy problem 
in Fig. 2 while addressing concrete issues in implementing 
the information sieve. 


Yi = arg max TC(X;Y) 

Y=fiX) 

Xl = Xi + Xi mod 2 
X^ = X 2 + Xi mod 2 
X 3 I = X 3 
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Figure 2. A simple example for which we imagine we have sam¬ 
ples of X drawn from some distribution. 


Step 1: Optimizing TC{X^~^;Yk) First, we construct 
a variable, Yk, that is some arbitrary function of X^~^ and 
that explains as much of the dependence in the data is pos¬ 
sible. Note that we have to pick the cardinality of Yk and 
we will always use binary variables. Dropping the layer 
indices, k, the optimization can be written as follows. 


TC{X) = TC{X) + TC(A; Y) (4) 


m^ V/(Xi;X)-/(X;X) 

p{y\x) 


( 6 ) 


The above corollary follows directly from Eq. 3 and the 
definition of perfect remainder information. Intuitively, it 
states that the dependence in X can be decomposed into a 
piece that is explained by Y, TC(X; Y), and the remaining 


Here, we have relaxed the optimization to allow for prob¬ 
abilistic functions of A. If we take the derivative of this 
expression (along with the constraint that p{y\x) should be 
normalized) and set it equal to zero, the following simple 
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fixed point equation emerges. 


p{y\x) 


pjy) -TT Pi^ily) 

Z{x) pixi) 


The state space of X is exponentially large in n, the num¬ 
ber of variables. Fortunately, this fixed point equation tells 
us that we can write the solution in terms of a linear num¬ 
ber of terms which are just marginal likelihood ratios. De¬ 
tails of this optimization are discussed in Sec. A. Note that 
the optimization provides a probabilistic function which 
we round to a deterministic function by taking the most 
likely value of Y for each X. In the example in Fig. 2, 
TC{X; Yi) = 1 bit, which can be verified from Eq. 2. 


Surprisingly, we did not need to restrict or parametrize the 
set of possible functions; the simple form of the solution 
was implied by the objective. Furthermore, we can also use 
this function to find labels for previously unseen examples 
or to calculate Y’s for data with missing variables (details 
in Sec. A). Not only that, but a byproduct of the procedure 
is to give us a value for the objective TC{X^~^;Yk), which 
can be estimated even from a small number of samples. 


Step 2: Remainder information Next, the goal is to 
construct the remainder information, Xf, as a probabilistic 
function of X^~^^Y^, so that the following conditions are 
satisfied: Yi) = 0 and , Y/,) = 

0. This can be done exactly and we provide a simple al¬ 
gorithm in Sec. C. Solutions for this example are given in 
Fig. 2. Concretely, we estimate the marginals, ^yk) 

from data and then write down a conditional probability ta¬ 
ble, ^yk), satisfying the conditions. The exam¬ 

ple in Fig. 2 was constructed so that the remainder informa¬ 
tion had the same cardinality as the original variables. This 
is not always possible. While we can always achieve per¬ 
fect remainder information by letting the cardinality of the 
remainder information grow, it might become difficult to 
estimate marginals of the foxmp{X^~^ ^Yk) at subsequent 
layers of the sieve, as is required for the optimization in step 
1. In results shown below we allow the cardinality of the 
variables to increase by only one at each level to avoid state 
space explosion, even if doing so causes I{X^~^]Yk) > 0 . 
We keep track of these penalty terms so that we can report 
accurate lower bounds using Eq. 3. 


Another issue to note is that in general there may not be 
a unique choice for the remainder information. In the ex¬ 
ample, /(X 3 ; Y) = 0 already so we choose Xl = A 3 , 
but xl = A 3 + Yi mod 2 would also have been a valid 
choice. If the identity transformation, Af = X^~^ satis¬ 
fies the conditions, we will always choose it. 


that TC(A^) = 0 and, using Eq. 5, we have TC{X) = 
TC(Ai) -1- TC(A;Yi) = 1 bit. Generally, in high¬ 
dimensional spaces it may be difficult to verify that the 
remainder information is truly independent. When the re¬ 
mainder information is independent, the result of attempt¬ 
ing the optimization I TC(A^“^; Y/^) = 0 . 

In practice, we stop our hierarchical procedure when the 
optimization in step 1 stops producing positive results be¬ 
cause it means our bounds are no longer tightening. Code 
implementing this entire pipeline is available (Ver Steeg). 

Prediction and compression Note that our condition for 
the remainder information that H{X^~^ | Af, Y/.) = 0 im¬ 
plies that we can perfectly reconstruct each variable X^~^ 
from the remainder information at the next layer. There¬ 
fore, we can in principle reconstruct the data from the rep¬ 
resentation at the last layer of the sieve. In the example, 
the remainder information requires two bits to encode each 
variable separately, while the data requires three bits to en¬ 
code each variable separately. The final representation has 
exploited the redundancy between Ai, A 2 to create a more 
succinct encoding. A use case for lossy compression is dis¬ 
cussed in Sec. 5. Also note that at each layer some variables 
are almost or completely explained (A^, Xl in the example 
become constant). Subsequent layers can enjoy a compu¬ 
tational speed-up by ignoring these variables that will no 
longer contribute to the optimization. 

4 Discrete ICA 

If A represents observed variables then the entropy, H{X), 
can be interpreted as the average number of bits required 
to encode a single observation of these variables. In prac¬ 
tice, however, if A is high-dimensional then estimating 
H{X) or constructing this code requires detailed knowl¬ 
edge of p{x), which may require exponentially many sam¬ 
ples in the number of variables. Going back at least to Bar- 
low (Barlow, 1989), it was recognized that if A is trans¬ 
formed into some other basis, Y, with the Y’s independent 
(TC{Y) = 0), then the coding cost in this new basis is 
H(Y) = H{Yj), i.e., it is the same as encoding each 
variable separately. This is exactly the problem of indepen¬ 
dent component analysis: transform the data into a basis 
for which TC(Y) = 0, or is minimized (Comon, 1994; 
Hyvarinen & Oja, 2000). 

While our method does not directly minimize the total cor¬ 
relation of Y, Eq. 5 shows that, because TC{X) is a data- 
dependent constant, every increase in the total correlation 
explained by each latent factor directly implies a reduction 
in the dependence of the resulting representation. 


Step 3: Repeat until the end At this point we repeat the 
procedure, putting the remainder information back into step 
1 and searching for a new latent factor that explains any re¬ 
maining dependency. In this case, we can see by inspection 


TC(X^) = TC(X) - Y, Yfc). 

k=l 

Since the terms in the sum are optimized (and always non- 
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Figure 3. On the far left, we consider three independent binary random variables, Si, S 2 , Ss- The vertical position of each signal is 
offset for visibility. From left to right: Independent source data is linearly mixed, x = As. This data, X, is fed into the information 
sieve. After going through some intermediate representations, the final result is shown on the far right. Consult Fig. 1(b) for the variable 
naming scheme. The final representation recovers the independent input sources. 


negative), the dependence is decreased at each level. That 
independence could be achieved as a byproduct of effi¬ 
cient coding has been previously considered (Hochreiter 
& Schmidhuber, 1999). An approach that leading to “less 
dependent components” for continuous variables has also 
been shown (Stogbauer et al., 2004). 

For discrete variables, which are the focus of this paper, 
performing ICA is a challenging and active area of re¬ 
search. Recent state-of-the-art results lower the complexity 
of this problem to only a single exponential in the number 
of variables (Painsky et al., 2014). Our method represents 
a major leap for this problem as it is only linear in the num¬ 
ber of variables, however, we only guarantee extraction of 
components that are more independent, while the approach 
of Painsky et. al. guarantees a global optimum. 

The most commonly studied scenario for ICA is to con¬ 
sider a reconstruction problem where some (typically con¬ 
tinuous) and independent source variables are linearly 
mixed according to some unknown matrix (Comon, 1994; 
Hyvarinen & Oja, 2000). The goal is to recover the ma¬ 
trix and unmix the components (back into their independent 
sources). Next we demonstrate our discrete independent 
component recovery on an example reminiscent of tradi¬ 
tional ICA examples. 

An ICA example Fig. 3 shows an example of recovering 
independent components from discrete random variables. 
The sources, S, are hidden and the observations, X, are a 
linearly mixture of these sources. The mixing matrix used 
in this example is 

A = ((l,1,1),(2,0,-1),(1,2,0),(-1,1,0)). 

The information sieve continues to add layers as long as 
it increases the tightness of the information bounds. The 
intermediate representations at each layers are also shown. 
For instance, layer 1 extracts one independent component, 
and then removes this component from the remainder infor¬ 
mation. After three layers, the sieve stops because X^ con¬ 


sists of independent variables and therefore the optimiza¬ 
tion of maxTC(X^; I 4 ) = 0 . 

In this case, the procedure correctly stops after three la¬ 
tent factors are discovered. Naively, three layers makes 
this a “deep” representation. However, we can examine the 
functional dependence of F’s and X’s by looking at the 
strength of the mutual information, /( 1 ^; X^~^), as shown 
in Fig. 4. This allows us to see that none of the learned 
latent factors (F’s) depend on each other so the resulting 
model is actually, in some sense, shallow. The example in 
the next section, for contrast, has a deep structure where 
F’s depend on latent factors from previous layers. Note 
that the structure in Fig. 4 perfectly reflects the structure 
of the mixing matrix (i.e., if we flipped the arrows and 
changes the F’s to 5'’s, this would be an accurate repre¬ 
sentation of the generative model we used). 



Figure 4. This visualizes the structure of the learned representa¬ 
tion for the ICA example in Fig. 3. The thickness of links is pro¬ 
portional to /(Yfc; X^ “ ^). 


While the sieve is guaranteed to recover independent com¬ 
ponents in some limit, there may be multiple ways to de¬ 
compose the data into independent components. Because 
our method does not start with the assumption of a linear 
mixing of independent sources, even if such a decompo¬ 
sition exists we might recover a different one. While the 
example we showed happened to return the linear solution 
that we used to generate the problem, there is no guarantee 
to find a linear solution, even if one exists. 
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5 Lossy compression on MNIST digits 

The information sieve is not a generative probabilistic 
model. We construct latent factors that are functions of the 
data in a way that maximizes the (multivariate) information 
that is preserved. Nevertheless, because of the way the re¬ 
mainder information is constructed, we can run the sieve in 
reverse and, if we throw away the remainder information 
and keep only the F’s, we get a lossy compression. We can 
use this lossy compression interpretation to perform tasks 
that are usually achieved using generative models includ¬ 
ing in-painting and generating new examples (the converse, 
interpreting a generative model as lossy compression, has 
also been considered (Hinton & Salakhutdinov, 2006)). 

We illustrate the steps for lossy compression and in¬ 
painting in Fig. 5. Imagine that we have already trained 
a sieve model. For lossy compression, we first transform 
some data using the sieve. The sieve is an invertible trans¬ 
formation, so we can run it in reverse to exactly recover the 
inputs. Instead we store only the labels, Y, throwing the 
remainder information, away. When we invert the 

sieve, what values should we input for X^.^? During train¬ 
ing, we estimate the most likely value to occur for each 
variable, Xf. W.l.o.g., we relabel the symbols so that this 
value is 0. Then, for lossy recovery, we run the sieve in re¬ 
verse using the labels we stored, Y, and setting = 0. 

In-painting proceeds in essentially the same way. We take 
advantage of the fact that we can transform data even in the 
presence of missing values, as described in Sec. A. Then we 
replace missing values in the remainder information with 
O’s and invert the sieve normally. 


Transform Invert 



- - 2 0 1 1 0 0 2 0 1 1 


Figure 5. (a) We use the sieve to transform data into some labels, 
Y plus remainder information. For lossy recovery, we invert the 
sieve using only the F’s, setting X’s to zero, (b) For in-painting, 
we first transform data with missing values. Then we invert the 
sieve, again using zeros for the missing remainder information. 

For the following tasks, we consider 50k MNIST digits 



Figure 6. (a) We visualize each of the learned components, ar¬ 
ranged in reading order, (b) The structural relationships among 
the latent factors is based on I(Yk; The size of a node 

represents the magnitude of TC{X^~^; Yk). 


that were binarized at the normalized grayscale threshold of 
0.5. We include no prior knowledge about spatial structure 
or invariance under transformations through convolutional 
structure or pooling, for instance. The 28x28 binarized 
images are treated as binary vectors in a 784 dimensional 
space. The digit labels are also not used in our analysis. 
We trained the information sieve on this data, adding lay¬ 
ers as long as the bounds were tightening. This led to a 12 
layer representation and a lower bound on TC (X) of about 
40 bits. It seems likely that more than 12 layers could be 
effective but the growing size of the state space for the re¬ 
mainder information increases the difficulty of estimation 
with limited data. A visualization of the learned latent fac¬ 
tors and the relationships among them appears in Fig. 6. 
Unlike the ICA example, the latent factors here have ex¬ 
hibit multi-layered relationships. 

The middle row of Fig. 7 shows results from the lossy com¬ 
pression task. We use the sieve to transform the original 
digits into 12 binary latent factors, Y, plus remainder in¬ 
formation for each pixel, Xl^jg^, and then we use the F’s 
alone to reconstruct the image. In the third row, the F’s 
are estimated using only pixels from the top half. Then we 
reconstruct the pixels on the bottom half from these latent 
factors. Similar results on test images are shown in Sec. D, 
along with examples of “hallucinating” new digits. 


6 Lossless compression 

Given samples of X drawn from p{x), the best compres¬ 
sion we can do in theory is to use an average of i7(X) bits 








The Information Sieve 


7 C>-r/£>Vr0ic;»#’^ 

9 5 O 1 % & / O i S ' 'i 9 S 0 ^ O 9 f *3 & & 9 Q 
9~ ^ ^9 7 ^ I G f S ^ f 9 ~9 G ^ 


Figure 7. The top row are randomly selected MNIST digits. In the second row, we compress the digits into the 12 binary variables, Y^, 
and then attempt to reconstruct the image. In the bottom row, we learn Y’s using just the pixels in the top half and then recover the pixels 
in the bottom half. 


for our compressed representation (Shannon, 1948). How¬ 
ever, in practice, if X is high-dimensional then we cannot 
accurately estimate p{x) to achieve this level of compres¬ 
sion. We consider alternate schemes and compare them to 
the information sieve for a compression task. 

Benchmark For a lossless compression benchmark, we 
consider a set of 60k of binarized digits with 784 pixels, 
where the order of the pixels has been randomly permuted 
(the same unknown permutation is applied to each image). 
Note that we have made this task artificially more diffi¬ 
cult than the straightforward task of compressing digits be¬ 
cause many compression schemes exploit spatial correla¬ 
tions among neighboring pixels for efficiency. The infor¬ 
mation sieve is unaffected by this permutation since it does 
not make any assumptions about the structure of the input 
space (e.g. the adjacency of pixels). We use 50k digits as 
training for models, and report compression results on the 
10k test digits. 

Naively, these 28 by 28 binary pixels would require 784 bits 
per digit to store. However, some pixels are almost always 
zero. According to Shannon, we can compress pixel i us¬ 
ing just H{Xi) bits on average (Shannon, 1948). Because 
the state space of each individual bit is small, this bound 
is actually achievable (using arithmetic coding (Cover & 
Thomas, 2006), for example). Therefore, we should be 
able to store the digits using H{Xi) ^ 297 bits/digit 
on average. 

We would like to make the data more compressible by first 
transforming it. We consider a simplified version of the 
sieve with just one layer. We let Y take m possible val¬ 
ues and then optimize it according to our objective. For 
the remainder information, we use the (invertible) function 
Xi = \xi — diTgmdiXzp{Xi = z\Y = y)\. In other words, X 
represents deviation from the most likely value of Xi for a 
given value of y. The cost of storing a digit in this new rep¬ 
resentation will be log 2 m + X]I=i H{Xi), where log 2 m 
bits are used to store the value of Y. 

For comparison, we consider an analogous benchmark in¬ 
troduced in (Gregor & LeCun, 2011). For this benchmark, 
we just choose m random digits as representatives (from 
the training set). Then for each test digit, we store the iden¬ 
tity of the closest representative (by Hamming distance), 


along with the error which we will also call Xi, so that we 
can recover the original digit. Again, the number of bits per 
digit will just by log 2 m plus the cost of storing the errors 
for each pixel according to Shannon. 
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Figure 8. This shows p(xi = l\y — k) for k — 1,..., 20 for 
each pixel, Xi, in an image. 

Consider the single layer sieve with Y = 1,..., m and 
m = 20. After optimizing, Fig. 8 visualizes the compo¬ 
nents of Y. As an exercise in unsupervised clustering the 
results are somewhat interesting; the sieve basically finds 
clusters for each digit and for slanted versions of each digit. 
In Fig. 9 we explicitly construct the remainder information 
(bottom row), i.e. the deviation between the most likely 
value of each pixel conditioned on Y (middle row) and the 
original (top row). 
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Figure 9. The top row shows the original digit, the middle row 
shows the most likely values of the pixels conditioned on the la¬ 
bel, y — 1,..., 20, and the bottom row shows the remainder or 
residual error, X. 

The results of our various compression benchmarks are 
shown in Table 1. For comparison we also show results 
from two standard compression schemes, gzip, based on 
Lempel-Ziv coding (Ziv & Lempel, 1977), and Huffman 
coding (Huffman et al., 1952). We take the better compres¬ 
sion result from storing and compressing the 784 x 50000 
data array in column-major or row-major order with these 
(sequence-based) compression schemes. Note that the 
sieve and random representative benchmark that we de¬ 
scribed require a codebook of fixed size whose contribu¬ 
tion is asymptotically negligible and is not included in the 
results. 
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Table 1. Summary of compression results. Results with a are 
reported based on empirical compression results rather than Shan¬ 
non bounds._ 


Method 

Bits per digit 

Naive 

784 

Huffman* (Huffman et al., 1952) 

376 

gzip* (Ziv & Lempel, 1977) 

328 

Bitwise 

297 

20 random representatives 

293 

50 random representatives 

279 

100 random representatives 

267 

20 sieve representatives 

266 

50 sieve representatives 

252 

100 sieve representatives 

243 


Discussion First of all, sequence-based compression 
schemes have a serious disadvantage in this setup. Because 
the pixels are scrambled, to take advantage of correlations 
would require longer window sizes than is typical. The 
random compression scheme does significantly better. De¬ 
spite the scrambled pixels, at least it uses the fact that the 
data consist of iid samples of length 784 pixels. However, 
the sieve leads to much better compression; for instance, 
20 sieve representatives are as good as 100 random ones. 
The idea behind “factorial codes” (Barlow, 1989) is that 
if we can transform our data so that the variables are inde¬ 
pendent, and then (optimally) compress each variable sepa¬ 
rately, we will achieve a globally optimal compression. The 
compression results shown here are promising, but are not 
state-of-the-art. The reason is that our discovery of discrete 
independent components comes at a cost of increasing the 
cardinality of variables at each layer of the sieve. To de¬ 
fine a more practical compression scheme, we would have 
to balance the trade-off between reducing dependence and 
controlling the size of the state space. We leave this direc¬ 
tion for future work. 

7 Related work 

The idea of decomposing multivariate information as an 
underlying principle for unsupervised representation learn¬ 
ing has been recently introduced (Ver Steeg & Gal- 
sty an, 2015; 2014) and used in several contexts (Pepke & 
Ver Steeg, 2016; Madsen et al., 2016). While bounds on 
TC{X) were previously given, here we provided an ex¬ 
act decomposition. Our decomposition also introduces the 
idea of remainder information. While previous work re¬ 
quired fixing the depth and number of latent factors in the 
representation, remainder information allows us to build 
up the representation incrementally, learning the depth and 
number of factors required as we go. Besides providing 
a more flexible approach to representation and structure 
learning, the invertibility of the information sieve makes 
it more naturally suited to a wider variety of tasks includ¬ 
ing lossy and lossless compression and prediction. Another 


interesting related result showed that positivity of the quan¬ 
tity TC{X;Y) (the same quantity appearing in our bounds) 
implies that the X’s share a common ancestor in any DAG 
consistent with px{x) (Steudel & Ay, 2015). A different 
line of work about information decomposition focuses on 
distinguishing synergy and redundancy (Williams & Beer, 
2010), though these measures are typically impossible to 
estimate for high-dimensional systems. Finally, a different 
approach to information decomposition focuses on the ge¬ 
ometry of the manifold of distributions defined by different 
models (Amari, 2001). 

Connections with ICA were discussed in Sec. 4 and the 
relationship to InfoMax was discussed in Sec. 1. The in¬ 
formation bottleneck (IB) (Tishby et al., 2000) is another 
information-theoretic optimization for constructing repre¬ 
sentations of data that has many mathematical similarities 
to the objective in Eq. 6, with the main difference being 
that IB focuses on supervised learning while ours is an un¬ 
supervised approach. Recently, the IB principle was used 
to investigate the value of depth in the context of supervised 
learning (Tishby & Zaslavsky, 2015). The focus here, on 
the other hand, is to find an information-theoretic principle 
that justifies and motivates deep representations for unsu¬ 
pervised learning. 

8 Conclusion 

We introduced the information sieve, which provides 
a decomposition of multivariate information for high¬ 
dimensional (discrete) data that is also computationally 
feasible. The extension of the sieve to continuous vari¬ 
ables is nontrivial but appears to result in algorithms that 
are more robust and practical (Ver Steeg et al., 2016). We 
established here a few of the immediate implications of the 
sieve decomposition. First of all, we saw that a natural no¬ 
tion of “remainder information” arises and that this allows 
us to extract information in an incremental way. Several 
distinct applications to fundamental problems in unsuper¬ 
vised learning were demonstrated and appear promising for 
in-depth exploration. The sieve provides an exponentially 
faster method than the best known algorithm for discrete 
ICA (though without guarantees of global optimality). We 
also showed that the sieve defines both lossy and lossless 
compression schemes. Finally, the information sieve sug¬ 
gests a novel conceptual framework for understanding un¬ 
supervised representation learning. Among the many de¬ 
viations from standard representation learning a few prop¬ 
erties stand out. Representations are learned incrementally 
and the depth and structure emerge in a data-driven way. 
Representations can be evaluated information-theoretically 
and the decomposition allows us to separately characterize 
the contribution of each hidden unit in the representation. 
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Supplementary Material for “The 
Information Sieve” 

A Detail of the optimization otTC{X]Y) 

We need to optimize the following objective. 

- I{X-Y) 

p{y\x) 

i 

If we take the derivative of this expression (along with 
the constraint that p{y\x) should be normalized) and set 
it equal to zero, the following simple fixed point equation 
emerges. 


p{y\x) = 


pjy) 'r\ p{xi\y) 

Z{x) p{xi) 


Surprisingly, optimizing this objective over possible func¬ 
tions has a fixed point solution with a simple form. This 
leads to an iterative solution procedure that actually corre¬ 
sponds to a special case of the one considered in (Ver Steeg 
& Galstyan, 2015). There it is shown that each iterative up¬ 
date of the fixed-point equation increases the objective and 
that we are therefore guaranteed to converge to a local op¬ 
timum of the objective. In short, we consider the empirical 
distribution over observed samples. For each sample, we 
start with a random probabilistic label. Then we use these 
labels to estimate the marginals, p{xi\y), then we use the 
fixed point to re-estimate p{y\x), and so on until conver¬ 
gence. 

Also, note that we can estimate the value of the objective in 
a simple way. The normalization term, Z{x) is computed 
for each sample by just summing over the two values of 
Y = y, since Y is binary. The expected logarithm of Z, or 
the free energy is an estimate of the objective (Ver Steeg 
& Galstyan, 2015). 


Algorithmic details The code implementing this opti¬ 
mization is included as a module in the sieve code (Ver 
Steeg). The algorithm is described in Alg. 1. Note we 
use S as the discrete delta function. The complexity is 
0{k X N X n), where n is the number of variables, N 
is the number of samples, and k is the cardinality of the 
latent factor, Y. Because the solution only depends on es¬ 
timation of marginals between Xi and Y, the number of 
samples needed for accurate estimation is small (Ver Steeg 
& Galstyan, 2015). 


Labeling test data The fixed point equation above essen¬ 
tially gives us a simple representation of the labeling func¬ 
tion in terms of some parameters which, in this case, just 
correspond to the marginal probability distributions. We 
simply input values of x from a test set into that equation, 
and then round y to the most likely value to generate labels. 


Algorithm 1 Optimizing TC{X; Y) 


Input: Data matrix, x\, 
i = 1 ,..., n variables, / = 1 ,..., samples. 
Specify k: Cardinality of V = 1,..., /c 

repeat 

Randomly initialize p(V = y\X = x^) 

p(Y = y) = l/nY,iPiy\x^) 

for i = 1 to n do 


p{Xi = Xi\Y = y) = 

X^)S:ci,x\IPiY = y) 

end for 

p{Y = y\x = Y) = 


llNY.iP{Y = y\X = 

-r-rn p{Xi=x\\Y=y) 
IIi=l p{Xi=x\) 


until Convergence 


Missing data Note that missing data is handled quite 
gracefully in this scenario. Imagine that some subset of the 
Xi^ are observed. Denote the subset of indices for which 
we have observed data on a given sample with G and the 
subset of random variables as xq- If we solved the opti¬ 
mization problem for this subset only, we would get a form 
for the solution like this: 


p{y\xG) = 


p{y) TT p{xi\y) 

z{x) pt^xi) 


In other words, we simply omit the contribution from un¬ 
observed variables in the product. 


B Proof of Theorem 2.1 

We begin by adopting a general definition for “representa¬ 
tions” and recalling a useful theorem concerning them. 

Definition The random variables Y = Yi,..., Y^ con¬ 
stitute a representation of X if the joint distribution fac- 
torizes, p{x,y) = Y{J=iP{yj\x)p{x),\lx e X,\/j e 
{1,..., m}, Vj/j e yj. A representation is completely de- 
fined by the domains of the variables and the conditional 
probability tables, p{yj |x). 

Theorem B.l. Basic Decomposition of Information (Ver 
Steeg & Galstyan, 2015) 

IfY is a representation of X and we define, 

n m 

TCl{X- Y)^Y. (7) 

i=l j=l 

then the following bound and decomposition holds. 

TC{X) > TC{X; Y) = TC{Y) + TCl{X; Y) (8) 

Theorem. Incremental Decomposition of Information 

Let Y be some (deterministic) function of Xi^^ X^ and 
for each i = ... ^n, Xi is a probabilistic function of 
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Xi , Y. Then the following upper and lower bounds on 
TC{X) hold. 


i=l 

TC{X) - {TC{X) + TC{X- Y)) < (9) 

j2H{X,\XuY) 

i=l 

Proof. We refer to Fig. 1(a) for the structure of the graph¬ 
ical model. We set X = Xi,..., X^, Y and we will write 
Xi:n to pick out all terms except Y. Note that because 
F is a deterministic function of X, we can view Xi as 
a probabilistic function of X^, F or of X (as required by 
Thm. B.l). Applying Thm. B.l, we have 

TC(X; X) = TC{X) + TCl(X; X). 

On the LHS, note that TC(X;X) = TC(X)-TC(X|X), 
so we can re-arrange to get 

TC{X)-{TC{X) + TC{X-,Y)) 

= TC{X\X) + TCl{X-,X) -TC{X;Y). ^ ^ 


The conditional mutual information, I{A;B\C) = 
I {A; BC) — I {A; C). We expand the first instance of CMI 
in the previous expression. 


TCl{X-X) =TC{X- Y) + Y.{I{Xi-Xu Y) 

i=l 

-I{Xi;Y)+I{Xi-,X-,\YXi) 

-I{Xi-X)). 

Since F = /(X), the first and fourth terms cancel. Finally, 
this leaves us with 


TCl{X-X) =TC{X-, Y)-J2 HXu Y) 

i=l 

n 

+ Y,I{Xi-,Xi\YXi). 
i=l 

Now we can replace all of this back in to Eq. 10, noting that 
the TC{X;Y) terms cancel. 

TC{X) - {TC{X) + TC(X; F)) 

n _ n _ _ .... 

= TC(X|X) - ^ J(Xi; y) + ^/(X,; 

i=l i=l 


The LHS is the quantity we are trying to bound, so we focus 
on expanding the RHS and bounding it. 

First we expand TCl(X;X) = “ 

Ei=i ^ Using the chain rule for mutual 

information we expand the first term. 

n 

TCl{X-X) = Y^J{Xi-Y) 

n 

+ ^/(Xi;Xi:„|y) 

i=l 

n 

-Y,I{XuX)-I{Y-X). 

i=l 

Rearranging, we take out a term equal to TC(X; F). 
TCl{X;X) =TC(X;F) + 

n n 

i=l i = l 

We use the chain rule again to write /(X^;Xi.^|F) = 
/(X,; X,|F) + I(Xi; X-|FX,), where X^ = Xi,..., X, 
with Xi (and F) excluded. 

n 

TCl{X-X) =TC(X;y) + ^(/(Xi;Xi|y) 

i=l 

+ /(X,;X.|FX,)-/(X,;X)). 


First, note that total correlation, conditional total corre¬ 
lation, mutual information, conditional mutual informa¬ 
tion, and entropy (for discrete variables) are non-negative. 
Therefore we trivially have the lower bound, LHS > 
— U). All that remains is to find the upper 

bound. We drop the negative mutual information, expand 
the definition of TC in the first line, then drop the negative 
of an entropy in the second line. 

n n 

LHS < Y, H{Xi\X) - H{X\X) + J2l(X,.Xi\YX,) 

n 

< Y {H{Xi\X)+I{Xi-,X^\YXi)) 

n 

= YH{Xi\Xi,Y) 

i=l 

The equality in the last line can be seen by just expanding 
all the definitions of conditional entropies and conditional 
mutual information. These provide the upper and lower 
bounds for the theorem. □ 

C An algorithm for perfect reconstruction 
of remainder information 

We will use the notation of Fig. 1(a) to construct remain¬ 
der information for one variable in one layer of the sieve. 
The goal is to construct the remainder information, Xi, as 
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Figure B.2. The same results as Fig. 7 but using samples from a test set instead of the training set. 
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Figure B.l. An illustration of how the remainder information, Xi, 
is constructed from statistics about p{xi, y). 


a probabilistic function of , F so that we satisfy the con¬ 
ditions of Lemma 2.2, 

(i) /(X,;F) = 0 {ii) H{Xi\Xi,Y) = 0. 


gotten at the next level of the sieve. In Sec. 6 we point out 
that this scheme is detrimental for lossless compression and 
we point out an alternative. 

Controlling the cardinality of Xi It is easy to imagine 
scenarios in Fig. B.l where the cardinality of Xi becomes 
very large. What we would like is to be able to approx¬ 
imately satisfy conditions (i) and (ii) while keeping the 
cardinality of the variables, Xi, small (so that we can ac¬ 
curately estimate probabilities from samples of data). To 
guide intuition, consider two extreme cases. First, imag¬ 
ine setting Xi = 0, regardless of Xi^y. This satisfies 
condition (i) but maximally violates (ii). The other ex¬ 
treme is to set Xi = Xi. In that case, (ii) is satisfied, but 
I{Xi; Y) = I{Xi; Y). This is only problematic if Xi is re¬ 
lated to Y to begin with. If it is, and we set Xi = Xi, then 
the same dependence can be extracted at the next layer as 
well (since we pass Xi to the next layer unchanged). 


We need to write down a probabilistic function p{xi\xi,y) 
so that, for the observed statistics, p{xi,y), these condi¬ 
tions are satisfied. There are many ways to accomplish this, 
and we sketch out one solution here. The actual code we 
use to generate remainder information for results in this pa¬ 
per are available (Ver Steeg). 

We start with the picture in Fig. B.l that visualizes the con¬ 
ditional probabilities p{xi\y). Note that the order of the Xi 
for each value of y can be arbitrary for this scheme to suc¬ 
ceed. For concreteness, we sort the values of Xi for each 
y in order of descending likelihood. Next, we construct 
the marginal distribution, p{xi). Every time we see a split 
in one of the histograms of p{xi\y), we introduce a cor¬ 
responding split for p{xi). Now, to construct p{xi\xi,y), 
for each Xi = q, for each y = j, we find the unique 
value of Xi = k{j, q) that is directly above the histogram 
fox p{xi = q). Then we p{xi = q\xi^y) = p{xi = 
q)lp{xi = k{j^q)\y = j). Now, marginalizing over Xi, 
p{xi\y) = p{xi), ensuring that /(X^; Y) = 0. Visually, it 
can be seen that H{Xi\Xi^ Y) = 0 by picking a value of 
Xi and y and noting that it picks out a unique value of Xi in 
Fig. B.l. 

Note that the function to construct Xi is probabilistic. 
Therefore, when we construct the remainder information 
at the next layer of the sieve, we have to draw Xi stochas¬ 
tically from this distribution. In the example in Sec. 3 the 
functions for the remainder information happened to be de¬ 
terministic. In general, though, probabilistic functions in¬ 
ject some noise to ensure that correlations with Y are for¬ 


In practice we would like to find the best solution with a 
cardinality of fixed size. Note that this can be cast as an 
optimization problem where p{xi = \xi,y) represent k x 
kx X ky variables to optimize over if those are the respec¬ 
tive cardinalities of the variables. Then we can minimize 
a nonlinear objective like O = H{Xi\Xi,Y) + /(X^; Y) 
over these variables. While off-the-shelf solvers will cer¬ 
tainly return local optima for this problem, the optimization 
is quite slow, especially if we let /c’s get big. 

For the results in this paper, instead of directly solving the 
optimization problem above to get a representation with 
cardinality of fixed size, we first construct a perfect solution 
without limiting the cardinality. Then we modify that solu¬ 
tion to let either (i) or (ii) grow somewhat while reducing 
the cardinality of Xi to some target. To keep /(X^; F) = 0 
while reducing the cardinality of Xi, we just pick the Xi 
with the smallest probability and merge it with another 
value for Xi . On the other hand, to reduce the cardinality 
while keeping H (X^ |X^, F) = 0, we again start by finding 
the Xi = k with the lowest probability. Then we take the 
probability mass for p{xi = k\xi,y) for each Xi and y and 
add it to the p{xi ^ k\xi,y) that already has the highest 
likelihood for that y combination. Note that /(X^; F) 
will no longer be zero after doing so. For both of these 
schemes (keeping (i) fixed or keeping (ii) fixed) we reduce 
cardinality until we achieve some target. For the results in 
this paper we alway picked . + 1 as the target and 

we always used the strategy where (ii) was satisfied and we 
let (i) be violated. In cases where perfect remainder infor- 
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mation is impractical due to issues of finite data, we have to 
define “good remainder information” based on how well it 
preserves the bounds in Thm. 2.1. The best way to do this 
may depend on the application, as we saw in Sec. 6. 

D More MNIST results 

Fig. B.2 shows the same type of results as Fig. 7 but using 
test data that was never seen in training. Note that no labels 
were used in any training. 

There are several plausible to generate new, never before 
seen images using the sieve. Here we chose to draw the 
variables at the last layer of the sieve randomly and inde¬ 
pendently according to each of their marginal distributions 
over the training data. Then we inverted the sieve to recover 
hallucinated images. Some example results are shown in 
Fig. D.l. 
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Figure D.L An attempt to generate new images using the sieve. 





